CMAJ Open — Latest Matching Preprints

1

Post-ED Trajectory Prediction in Abdominal Pain with a Generative Medical Event Model

McCann, K. A.; Wright, D. S.; Iscoe, M. S.; Melnick, E. R.; Ohno-Machado, L.; Meeker, D.; Venkatesh, A. K.; Sangal, R. B.; Loza, A. J.

2026-05-21 emergency medicine 10.64898/2026.05.18.26353199 medRxiv

Top 0.1%

6.4%

Show abstract

Importance: Abdominal pain causes roughly 10 million US emergency department (ED) visits annually, most resulting in discharge. Post-discharge courses vary, yet existing risk models predict only whether an ED revisit occurs, not what that revisit outcome will entail. Objective: To evaluate whether Curiosity, a generative medical event foundation model, can predict post-ED-discharge trajectories for adults with abdominal pain, differentiating the timing and severity of expected outcomes. Design: Retrospective cohort study; encounters January 1-December 31, 2022; 30-day follow-up; analysis conducted in 2026. Setting: Epic Cosmos research network (multicenter, population-based, de-identified electronic health record). Participants: Adults ([≥]18 years) discharged from the ED with abdominal pain, excluding training-set patients. Random sample of 3,000 drawn from 150,030 eligible patients (65.3% female; median age 47 years [IQR 36-60]). Exposure: ED discharge after evaluation for abdominal pain. Main Outcomes and Measures: Primary: Curiosity model vs. per-task, separately estimated XGBoost models on area under the receiver operating characteristic curve (AUROC) for ED revisit ending in admission (admit-revisit), ED revisit ending in discharge (DC-revisit), and any ED revisit at 72 hours, 7 days, and 30 days. Secondary: trajectory-level accuracy across 36 trajectory classes and edit distance vs XGBoost; calibration of simulated vs observed conditional path probabilities across 45 transitions. Results: Curiosity identified patients at high risk of revisit requiring admission more accurately than XGBoost and differentiated those likely to revisit without admission. Among 3,000 patients, Curiosity's 30-day admit-revisit AUROC was 0.83 (95% CI 0.79-0.87) vs 0.70 (95% CI 0.65-0.75) for XGBoost (DeLong P<.001), and admit-revisit AUC-PR was 0.37 (95% CI 0.29-0.46) against a 4.1% cohort base rate, vs XGBoost 0.13 (95% CI 0.09-0.19). Curiosity identified the most likely trajectory out of 36 possibilities for 45.9% of patients (XGBoost 41.0%; McNemar P<.001), with median edit distance 1.28 vs 1.40 (Wilcoxon P<.001). Median absolute calibration error across 45 transitions was 1.30 percentage points (95% CI 0.32-2.49). Conclusions and Relevance: A generative medical event foundation model produced calibrated trajectory-level predictions and discriminated admit-revisits more effectively than task-specific XGBoost baselines, separating patients that revisited and were admitted from those who revisited and were discharged.

2

Cancer care disruption during the COVID-19 pandemic in Ontario, Canada: A sequential mixed-methods study

Timilshina, N.; Jacobson, D.; Birze, A.; Wodchis, W. P.; Kuluski, K.; Strumpf, E.; Ammi, M.

2026-06-12 health systems and quality improvement 10.64898/2026.06.10.26355360 medRxiv

Top 0.1%

4.9%

Show abstract

Introduction The COVID-19 pandemic profoundly disrupted healthcare delivery worldwide, with cancer care among the most affected services. Prior studies documented delays in referrals, reduced specialist access, and increased provider burden. However, the extent to which these experiences were reflected at the system level remains unclear. Objective To document cancer care experiences and examine whether these experiences were reflected in population-level health system indicators across Ontario, Canada. Methods We used an exploratory sequential mixed-methods design. Qualitative data were collected through focus groups and semi-structured interviews with 32 participants, including patients with cancer (n=8), caregivers (n=5), healthcare providers (n=14), and decision-makers (n=5) across two hospital settings in Ontario, Canada. Emergent themes informed the development of quantitative indicators. We then conducted a retrospective population-based analysis of linked administrative health databases for cancer patients in Ontario (n=87,786) to assess the prevalence of identified themes. Results Four themes emerged: (I) delays in diagnosis and screening; (II) disrupted access to primary care; (III) barriers to specialist and mental health services; and (IV) fragmented care for patients with multimorbidity. Quantitative findings corroborated major themes. Screening rates declined for cervical (64.8% to 57.5%) and breast cancer (64.5% to 57.2%). While in-person primary care shifted almost entirely to virtual modalities (8.5% to 95.4%), overall visit volumes remained stable. Specialist care showed uneven patterns, with increased oncology visits but declines in cardiology and mental health services. Patients with multiple comorbidities experienced the largest reductions in non-oncology specialist care. Conclusion The pandemic disrupted key components of cancer care, particularly screening, access to certain specialist services, and care for patients with complex needs. Integrating qualitative and quantitative evidence highlights areas of system vulnerability and underscores the need for coordinated, resilient cancer care capable of maintaining essential services during future crises.

3

Systematic Analysis of Housing Referral Outcomes in New York City's WholeYouNYC Social Care Network: Identifying Barriers to Service Connection

Conde, F.

2026-05-22 health systems and quality improvement 10.64898/2026.05.19.26353634 medRxiv

Top 0.1%

2.4%

Show abstract

Background: Health-related social needs (HRSNs), particularly housing instability, are significant drivers of poor health outcomes among Medicaid populations. New York State's Social Care Networks (SCNs) aim to systematically connect members to housing services through coordinated referral systems. However, limited systematic analysis of referral patterns hinders quality improvement efforts. We analyzed housing referral outcomes and workflows to identify barriers to successful service connections. Methods: We conducted a mixed-methods quality improvement study at Public Health Solutions' WholeYouNYC SCN Coordination Center. Quantitative analysis examined 4,258 housing referrals submitted between June 2025 and January 2026, extracted from the Unite Us platform via Power BI dashboard. We calculated acceptance rates, analyzed time metrics, and examined outcomes by receiving organization. Qualitative data were collected through structured consultations with 7 staff members (5 navigators, 2 supervisors) and review of internal workflow documentation. Process mapping identified workflow bottlenecks. Results: Of 4,258 housing referrals, only 45% (n=1,936) were accepted by receiving organizations, while 19% (n=815) were rejected and 32% (n=1,382) remained awaiting response with no recorded action. Average time to acceptance was 8 days for accepted referrals. Acceptance rates were consistent across top receiving organizations (44-46%), suggesting systemic rather than partner-specific barriers. Analysis of unresolved referrals revealed prolonged cases, with the longest pending 271 days. Three critical workflow bottlenecks were identified: CBO response delays, missing housing documentation, and challenges with client engagement. Conclusions: Low housing connection rates (45%) and prolonged unresolved referrals (up to 271 days) indicate systemic barriers requiring interventions at multiple levels. Recommendations include establishing CBO response time benchmarks, implementing automated follow-up protocols, standardizing documentation requirements, and enhancing real-time data monitoring. These findings provide an evidence-based framework for quality improvement in social care coordination programs.

4

Increasing Efficiency, Persistent Burden: Longitudinal Analysis of EHR Use and After-Hours Work in Emergency Medicine Residency

Preiksaitis, C. M.; Hughes, J.; Iscoe, M.; Makutonin, M.; Rider, A.; Melnick, E.; Rose, C.

2026-05-21 medical education 10.64898/2026.05.19.26353524 medRxiv

Top 0.1%

1.8%

Show abstract

Objectives: Electronic Health Records (EHRs) impose a significant time burden on physicians, often requiring work to be completed outside of scheduled hours. While this burden is well-documented, how it evolves throughout emergency medicine (EM) residency remains poorly understood. This study aimed to quantify EHR usage patterns, analyze the composition of after-shift work, and characterize the development of EHR efficiency across EM training. Methods: We conducted a retrospective cohort study of EM residents (postgraduate year [PGY] 1-4) using 5.5 years of EHR audit log data (2020-2025) at a single academic institution. We analyzed EHR time per new patient encounter, stratified by postgraduate year, and categorized activities into domains such as documentation, chart review, and orders. EHR work was measured both during and after scheduled shifts. Results: The analysis included 144 unique residents and 167,010 new patient encounters across 15,386 shifts. Encounter-attributed EHR time per encounter decreased by 52% from PGY-1 to PGY-4 (median 19.9 to 9.6 minutes, p<0.001), despite an 86% increase in patient volume per shift (median 7 to 13 encounters). This efficiency gain was driven primarily by a 69% reduction in documentation time (9.3 to 2.9 minutes), accompanied by shorter notes. After-shift work (EHR activity after the 9-hour clinical shift) was present in 89.9-94.4% of encounters. At the shift level, combined after-shift EHR time (encounter-attributed plus tracking board) was a median of 64.2 minutes per shift for PGY-1 and 104.2 minutes for PGY-4. Shift-level tracking board activity dominated the after-shift burden and increased with training (median 40.2 to 79.0 minutes per shift from PGY-1 to PGY-4). Conclusions: EM residents achieve substantial gains in on-shift EHR efficiency, with the largest reductions observed in documentation time, accompanied by shorter notes and faster input speed. However, a persistent after-hours workload, dominated by administrative and patient flow tasks, suggests that (at least at this single institution) system-level factors--not just individual skill--may contribute to this pattern. Monitoring these objective EHR metrics may help programs identify struggling learners and evaluate the impact of interventions aimed at improving resident well-being and workflow efficiency.

5

Cross-Model Variability in Large Language Model Triage Behavior for Potential Stroke Symptoms

Dworkis, D. A.; Stenstrom, J.; Sen, A.; Lucarelli, R. T.

2026-05-25 emergency medicine 10.64898/2026.05.22.26353904 medRxiv

Top 0.1%

1.7%

Show abstract

Background: Stroke is a time-sensitive neurological emergency in which early EMS activation and presentation to definitive care are cornerstones of effective therapy. Large language models (LLMs) are increasingly consulted by the public for medical advice, but the veracity of the guidance provided by commercially available models responding to potential stroke symptoms is not well understood. Methods: We performed a cross-model benchmarking study comparing the triage choices of three frontier LLMs (Claude Sonnet 4.6, GPT-4o, and Llama 3.3-70b-versatile) on first-person vignettes describing a unilateral arm symptom on waking, across 10 symptom descriptors, and two clinical phases (before and after a partially reassuring self-examination), with or without a clinical distractor (n=50 per condition). Results: Claude sought emergency care most often, Llama least, and GPT-4o in between, diverging most sharply in the post-examination phase where Claude called 911 in 100% of runs, Llama called for non-emergency help in 100%, and GPT-4o was symptom-dependent. A distractor shifted behavior away from emergency care in almost all conditions: calling 911 fell from 37.9% to 14.6% and waiting rose from 0% to 45.9% in the post-examination vignette. Responses were also sensitive to symptom word: weak, limp, heavy, and clumsy generated higher alarm, whereas numb, tingly, odd, strange, and weird generated less urgent responses. Conclusions: The increasing use of LLMs for medical advice has significant public health implications. Commercially available LLMs show significant model-to-model variability and framing sensitivity when confronted with potential stroke symptoms, including under-recognition of canonical CDC warning descriptors, underscoring the need for systematic benchmarking as these tools become de facto first points of contact for patients experiencing neurological emergencies.

6

Modeling the Impact of Pediatric RSV Immunization in Massachusetts, 2024--2025

Jones, L.; Ergas, R.; Tibbs, A.; Russo, E. T.; Norville, J.; Bingay, B.; Brown, C. M.; Reich, N. G.; Pasco, R.

2026-06-10 epidemiology 10.64898/2026.06.05.26354236 medRxiv

Top 0.1%

1.5%

Show abstract

Background Pediatric immunizations for Respiratory Syncytial Virus (RSV), including monoclonal antibodies for infants and vaccines for pregnant people, have become broadly available and can prevent severe RSV outcomes in infants. However, quantifying the impact of RSV immunization in prevention of severe pediatric illness at the population-level is limited by lack of RSV case surveillance data. The Massachusetts Department of Public Health (DPH) conducted a modeling analysis using routine public health surveillance data to estimate the state-level impact of new RSV immunization products on Emergency Department (ED) visits and hospitalizations in Massachusetts for highest risk pediatric groups. Methods A scenario projection tool, called R.Scenario.Vax, was utilized to simulate RSV-associated ED hospital encounters by age group in the context of newly available immunizations. ED visit and hospitalization data from the National Syndromic Surveillance Program (NSSP) during the time period 10/08/2017--10/19/2024 were analyzed, scaled to account for changes in RSV testing practices over time and missing encounter volume in historic data, and utilized to inform model fit of a "typical" RSV season. RSV immunization data from the Massachusetts Immunization Information System (MIIS) for the 2023--2024 and 2024--2025 RSV seasons informed high and moderate pediatric RSV immunization coverage scenarios and their impact was compared to a counterfactual reference scenario of no new immunizations. Median projections were quantitatively and qualitatively compared to observed 2024--2025 season data. Percent reduction in hospital encounters and encounters averted per 10,000 population were calculated for each scenario as compared to the reference. Results Projections for the youngest at-risk age groups showed significantly lower RSV-associated ED visits and hospitalizations during the 2024--2025 season for both high and moderate immunization coverage scenarios. Median projections for infants under 6 months old in the highest coverage scenario, wherein nearly all infants were immunized, showed 72.6% lower ED visits and 73.4% lower hospitalizations when compared to the reference scenario, equating to 262 ED visits and 85 hospitalizations averted per 10,000 population. Conclusions Our results support the use of modeling methods for public health insights and suggest that RSV immunizations for infant populations result in significantly lower RSV-related ED encounters in Massachusetts.

7

Variation in Telehealth Use in a National Home Test-to-Treat Program for Acute Respiratory Infections

Losos, W.; Wang, B.; Fisher, K.; O'Connor, L.; Soni, A.; Gerber, B.

2026-05-26 health informatics 10.64898/2026.05.24.26353984 medRxiv

Top 0.1%

1.2%

Show abstract

Background Home Test-to-Treat (HTTT) programs deliver timely antiviral treatment for acute respiratory infections, including COVID-19 and influenza, through at-home testing and telehealth. Because access is often measured by visit occurrence, variation in how and when care is delivered may be overlooked. We hypothesized that telehealth access follows distinct process-based patterns. Methods We analyzed de-identified encounters from the national HTTT program (September 2023-July 2024); 6,213 of 8,160 eligible individuals remained after exclusions for missing data. Phenotypes were derived by k-means clustering of standardized variables capturing encounter timing, modality preference, process duration, and sociodemographic and digital access attributes. Ten-day surveys assessed symptom duration and healthcare utilization. Results Three phenotypes emerged: Delayed/Disrupted Access (n = 1,537; 24.7%), Digitally Engaged but Socioeconomically Vulnerable (n = 1,460; 23.5%), and Mainstream Access and Efficient Utilization (n = 3,216; 51.8%). Mean process duration differed (15.93 [SD 3.84] vs 3.69 [3.31] vs 2.87 [2.41] hours; p < 0.001). Synchronous preference was lowest in the Digitally Engaged group (22.9%); antiviral prescribing was high (88.6%-91.9%). Among 10-day respondents (n = 1,023), symptom duration did not differ. Emergency department visits were most frequent in the Digitally Engaged group (2.3% vs 0.0% and 0.5%; p = 0.02) and urgent care in the Delayed/Disrupted group (5.8% vs 4.1% vs 2.0%; p = 0.02). Conclusions Telehealth use in a national HTTT program formed distinct phenotypes defined by timing, modality, and care-process efficiency. Evaluating equity requires attention to how and when care is delivered, not simply whether it occurred.

8

Operational Enablers and Barriers in Hospital Incident Command: Insights from a Single-Center Table-Top Exercise at a Tertiary Care University Hospital-A Qualitative Phenomenological Study

Ries, M.; von der Forst, M.; Schaefer, H.; Bikowski, K.; Franzen, K.; Geoerg, P.; Weykamp, F.; Popp, E.; Kuellenberg, J.

2026-05-17 emergency medicine 10.64898/2026.05.13.26353139 medRxiv

Top 0.1%

0.9%

Show abstract

Background: In crises, hospitals must rapidly shift from routine operations to structured crisis management, requiring the activation of an incident command system. However, empirical insight into their operational functioning during activation remains limited. Goal: to identify operational enablers and barriers influencing effective crisis response. Methods: Prospective cross-sectional, qualitative, single-center study conducted after a table-top exercise within a hospital incident command system at a tertiary care university hospital (NCT06913010). Data was collected through semi-structured interviews, participant observation, and document analysis, and analyzed using a narrative-phenomenological approach. Results: Nineteen participants were included. Analysis identified nine thematic clusters shaping operational performance: (1) structure and roles; (2) communication; (3) decision-making and prioritization; (4) information management; (5) infrastructure and technology; (6) personnel and organization; (7) training, exercises, and team dynamics; (8) documentation; and (9) external communication and media. Enablers included clear role definition, structured communication, phased decision-making, and regular training. Barriers included role ambiguity, fragmented communication, insufficient prioritization, infrastructure limitations, and staffing constraints. Conclusion: Preparedness frameworks are necessary but insufficient as stand-alone approaches, as operational execution determines real-world performance. Recurring deficits included unclear roles, inconsistent communication, weak prioritization, and gaps in infrastructure and personnel. A limited set of standardized practices - including a clear separation od roles, leadership intent, closed-loop communication, explicit decision cycles from information gathering to structuring to decision-making, checklists, visualization, central information management, and rapid "80% decisions"-substantially enhanced performance. Mission command (Auftragstaktik) further enabled adaptive, coordinated action. Strengthening hospital incident command is a key lever for achieving system-level resilience in crises.

9

Estimating Infectious Disease Importation Risk during the 2026 FIFA World Cup

Herrera-Diestra, J. L.; Bi, K.; Ptak, S.; Ertem, Z.; Al-amery, A.; Harris, M.; Meyers, L. A.

2026-06-04 public and global health 10.64898/2026.06.03.26354828 medRxiv

Top 0.2%

0.9%

Show abstract

Background. The 2026 FIFA World Cup will bring an estimated 1--5~million international visitors to 11~US host cities between June~11 and July~19, 2026---the largest tournament in history. Large-scale international gatherings accelerate importation of infectious diseases from diverse source populations. Advance estimation of importation risk is essential for public health preparedness and surveillance prioritization. Methods. We developed a Poisson importation framework applied to five diseases (dengue fever, influenza, malaria, measles, and pertussis) across the 11~US venue cities. Three nested travel models of increasing resolution were constructed: a baseline model using routine June~2024 arrival data; a World Cup--adjusted model incorporating projected visitor growth factors; and a schedule-driven model routing WC fans to specific cities based on match assignments. WHO incidence and BTS T-100 routing fractions were combined with Monte Carlo uncertainty propagation (5,000 Uniform draws on under-reporting and travel-while-infectious parameters) to yield median importation estimates with 95\% uncertainty intervals. Results. Dengue posed the highest importation risk at most venue cities under the schedule-driven model (median $\Lambda > 10$ expected importations from Brazil alone; 95\% uncertainty interval 5.9--33.1), robust across the full literature-supported parameter range; Atlanta was the exception, where malaria probability exceeded dengue, driven by direct travel from West and Central African nations. Influenza ranked second at most cities, coinciding with the Southern Hemisphere winter peak. Pertussis showed broad geographic spread but carries the widest relative uncertainty, as the assumed detection rate sits at the upper bound of the literature range. Background tourism accounted for the dominant share of total importation risk; the World Cup fan increment contributed approximately 8.3\% of projected arrivals for WC-qualified nations. Conclusions. This Poisson importation framework, built entirely from publicly available data, provides reproducible importation risk estimates for mass gathering events. The framework extends to additional diseases, cities, and gatherings, offering a transparent baseline complementary to proprietary modeling systems.

10

Professionalism Pulse: Development and Validation of a Natural Language Processing Pipeline and Dashboard for Safety Culture Surveillance in NYC Health + Hospitals

Mangut, E.; Wallace, R.

2026-05-22 health informatics 10.64898/2026.05.19.26353620 medRxiv

Top 0.2%

0.8%

Show abstract

Background: Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the "professionalism climate" within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States. Methods: A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline's accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen's Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance. Results: The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions. Conclusion: Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.

11

Coaching for quality improvement under performance-based contracting: a theory-of-change evaluation in Honduras

Munar, W. J.; Aranda, L. E.; Lauria, M. E.; Bernal Lara, P.; Innocenti, C.; Rodriguez, M.

2026-05-30 health systems and quality improvement 10.64898/2026.05.21.26353487 medRxiv

Top 0.2%

0.8%

Show abstract

Introduction. Practice coaching is increasingly used to strengthen quality improvement (QI) capacity in primary healthcare (PHC) systems in low and middle income countries (LMICs), yet the causal pathways through which it shifts provider behaviour, and the systemic conditions that enable or constrain those pathways, remain under theorised. Using a theory based qualitative evaluation, we examined how and why a practice coaching intervention influenced QI in cervical cancer screening (CCS) and antenatal care (ANC) within Honduras decentralised PHC system during the third phase of the Salud Mesoamerica Initiative (SMI). Methods. We conducted a within case explanatory case study. A programme theory was reconstructed before data collection and iteratively refined against evidence. Data comprised semi structured interviews with 11 midlevel managers, 6 PHC team medical leads, and 2 regional managers, complemented by direct observation and document review. We applied combined deductive and inductive coding, thematic analysis, and pattern matching, and reporting per COREQ. Results. We identified four causal patterns that refined the initial programme theory. Three were activated pathways: (1) novel professional identity among participating managers; (2) collective efficacy and data driven learning, sustained through verifiable progress on observable indicators, strong for CCS but null for ANC, where outcomes were less attributable to teams actions; and (3) relational coordination, psychological safety, and trust, which provided the interpersonal basis for the first two. A fourth, unanticipated pattern showed structural misalignment between coaching enabling, learning based logic and the directive, punitive logic of Honduras performance based contracting environment, confining gains to localised enabling bubbles. Conclusion. Coaching can activate meaningful QI pathways in LMIC primary care, but sustained, equitable impact requires deliberate alignment between coaching learning oriented principles and the institutional performance management architecture, and matching of coaching investment to clinical processes with observable, attributable outcomes.

12

Multinational Public Opinion on Race, Ethnicity, and Algorithmic Reform in Medicine

Adibi, A.; Le, K. X.; Pierson, E.; Diao, J. A.; Esfandiari, N.; Carlsten, C.; Sadatsafavi, M.

2026-05-21 health policy 10.64898/2026.05.15.26352687 medRxiv

Top 0.2%

0.8%

Show abstract

Importance: Several professional medical societies have removed race and ethnicity from widely used clinical algorithms with implications for millions of patients. Yet the opinions of patients and the public regarding the tensions underlying these pivotal changes have not been systematically explored. Objective: To assess global public opinion on the use of race or ethnicity in clinical algorithms, including preferences for different approaches to algorithmic reform and perceptions of alternative predictors. Design: Cross-sectional survey study. Setting: Multinational opt-in online survey conducted via Prolific in January 2026. Participants: A volunteer convenience sample with quota sampling to achieve approximately equal participation by sex at birth and across ten categories of self-identified race and ethnicity. Main Outcomes and Measures: Self-reported comfort with demographic and social predictors in clinical calculators, with net comfort defined as percentage extremely or somewhat comfortable minus percentage extremely or somewhat uncomfortable; preferences for race-specific versus race-free algorithms; perceptions of algorithmic harm or benefit. Results: Of 1,050 responses, 994 (94.7%) met eligibility criteria. Participants resided in 43 countries with a median age of 32.0 years (IQR, 26-41). Net comfort with the use of race or ethnicity in a hypothetical cancer risk calculator was +62.4% (95% CI: +57.8% to +66.9%), compared with +14.5% (95% CI: +9.1% to +19.9%) for postal or ZIP code. Overall, 87.9% (95% CI: 85.9% to 90.0%) were comfortable with race or ethnicity if a clinician explained its use and only 12.8% agreed race and ethnicity should never be used clinically. Across spirometry, kidney function, and cardiovascular risk calculators, 40.0% to 47.6% preferred race-specific versions, whereas 16.7% to 28.2% preferred race-free alternatives. Furthermore, a substantial proportion disagreed that they were well-represented by race and ethnicity categories, ranging from 22.1% for osteoporotic fracture risk equations to 42.9% for cardiovascular risk equations. These findings were consistent across countries, self-identified race and ethnicity, and among participants reporting prior experiences of racism in healthcare. Conclusions and Relevance: In our diverse multinational survey study, respondents were comfortable with the use of race and ethnicity across application areas, but often did not feel represented by existing categories and were less comfortable with the use of alternatives based on postal or ZIP codes.

13

Temporal Changes in Immunization Information Systems Across U.S. States and Jurisdictions, 2000-2024

Chen, T.; Watanabe, M.; Callaghan, T.; Shioda, K.

2026-06-02 health policy 10.64898/2026.05.29.26354476 medRxiv

Top 0.2%

0.8%

Show abstract

Background: Statewide immunization data are essential for monitoring vaccination trends and evaluating immunization program impact. In the United States, Immunization Information Systems (IIS) were established in the early 1990s to collect these data; however, operational, legal, and procedural details vary across states and over time. This study summarized differences in IIS characteristics, such as legal requirements and reporting procedures, across U.S. states and jurisdictions over time. Methods: We analyzed survey data from previous work in 2000 and the Centers for Disease Control and Prevention (CDC) in 2012, 2018, and 2024. Our review focused on legislation and reporting requirements for immunization registries across 50 states and 14 jurisdictions, including U.S. territories and Freely Associated States. Results: Between 2000 and 2024, legal frameworks and reporting practices for immunization registries expanded across U.S. states and jurisdictions. The number of states with laws or administrative rules authorizing immunization registries increased from 24 states in 2000 to all 50 states, the District of Columbia, five metropolitan areas, five U.S. territories, and three Freely Associated States in 2024. Over the same period, reporting requirements also became more widespread. The number of states and jurisdictions mandating providers to report immunization records increased from 12 in 2000 to 54 in 2024. Consent policies also changed over time. By 2024, most states and jurisdictions had adopted implicit consent for reporting children's immunization records (41; 64%), while a smaller proportion required explicit parental consent (7; 11%) or implemented mandatory reporting without consent (14; 22%). Discussion: IIS infrastructure and reporting requirements have expanded across U.S. states and jurisdictions over the past two decades, while heterogeneity in consent policies and reporting practices persists. These temporal changes may need to be considered when interpreting IIS data, particularly in longitudinal and cross-jurisdictional analyses.

14

Bridging Policy and Practice: Parents and Caregivers Experiences with the Interim Canada Dental Benefit in Canada

Olatosi, O. O.; Baltus, T. H. L.; Mittermuller, B.-A.; Fux, S.; Monayao, A.; Lee, J.; Menon, A.; Yerex, K.; Goubran, S.; Schroth, R. J.

2026-05-15 public and global health 10.64898/2026.05.12.26352368 medRxiv

Top 0.2%

0.7%

Show abstract

Background: Access to dental care remains a significant challenge for many children in Canada, particularly among low-income and underserved populations. The Interim Canada Dental Benefit (CDB), introduced in October 2022, aimed to reduce financial barriers to oral health care for children under 12 years of age while the Canadian Dental Care Plan (CDCP) was being developed. While emerging evidence has examined program uptake, limited qualitative research has explored parents and caregivers experiences with the Interim CDB. Objective: This study aimed to explore parents and caregivers experiences with the Interim CDB in Manitoba, Canada, including awareness, access, perceived benefits, challenges, and recommendations for program improvement. Methods: A qualitative descriptive study was conducted using semi-structured interviews with 30 parents and caregivers of children under 12 years of age. Participants were recruited primarily through community dental clinics. Interviews were conducted between July 2023 and February 2024, audio-recorded, and transcribed verbatim. Data were analyzed using inductive thematic analysis to identify key themes and subthemes. Results: Seven interconnected themes were identified: (1) limited and uneven awareness of the Interim CDB; (2) inadequate and inequitable communication strategies; (3) barriers to accessing the benefit, including misconceptions about eligibility and complex application processes; (4) dental providers as key facilitators of access; (5) financial relief and improved access to care; (6) gaps in coverage and ongoing financial strain; and (7) participant-driven recommendations for improvement. While the benefit was widely perceived as reducing financial barriers and enabling access to care, challenges related to awareness, communication, and adequacy of coverage limited its overall effectiveness. Participants emphasized the need for improved communication from government, simplified application processes, expanded eligibility, and increased financial support. Conclusion: The Interim CDB represents an important step toward improving access to dental care for children in Canada. However, this study highlights critical implementation gaps related to awareness, accessibility, and coverage. Addressing these challenges will be essential to ensuring the success of the new CDCP and advancing equitable access to oral health care.

15

Operationalizing Eight-Dimensional Patient-Safety Risk Scoring at Scale: A Multi-Model Large Language Model Reliability Study

LIn, H.-M.; Lyu, J.; Wang, I.-L.

2026-06-01 health informatics 10.64898/2026.05.29.26354437 medRxiv

Top 0.2%

0.7%

Show abstract

Background: Hospital incident risk scoring has long relied on two- or three-dimensional frameworks (Severity Assessment Codes or Risk Priority Numbers),even though root cause analysis standards recognize that clinical risk is multi-factorial. The obstacle has been mainly cognitive: human reviewers cannotreliably score many dimensions across high incident volumes, so richer assessmenthas not been operationalized at scale.Objective: To extend the traditional three-dimensional FMEA to an eight-dimensional patient-safety risk feature framework, to establish a multi-modellarge language model (LLM) extraction pipeline that scores these dimensionsautomatically, and to demonstrate a variance-aware integer optimization (mean-variance integer programming, MV-IP) that provides a reproducible tie-breakingrule for incident prioritization under extraction uncertainty, rather than improvedrisk coverage.Methods: An 8-dimensional framework covering harm severity, potential harm,frequency, detectability, systemic impact, vulnerable populations, regulatoryrelevance, and economic impact was applied to 213 synthetic and 196 realcurated incident narratives. Three independent LLMs (GPT-5.4, Gemini 3.1 Pro, Grok-4.1 Fast) from different provider families extracted structured risk scores.Inter-model consistency was assessed via ICC(A,1). Among coverage-equivalentselections, MV-IP minimized inter-model variance to give a reproducible prioriti-zation rule. An English-language sensitivity analysis was conducted on 31 AHRQPSNet WebM&M cases.Results: On real cases, seven of eight dimensions reached Fair or betterinter-model reliability (ICC(A,1) 0.53 to 0.83); D5 (Systemic Impact) was theexception at Poor reliability (0.275), driven by little between-case variation ratherthan by wide model disagreement. Reliability was not uniform: two dimensionswere Excellent (D1 actual harm 0.834, D8 economic impact 0.782), two Good,and three only Fair, so some dimensions are more readily extractable than others.The same anchors gave broadly similar results on English-language narratives.When deterministic top-K selection returned several equal-coverage solutions(11 on real cases, total inter-model variance 0.205 to 1.274), MV-IP selected theminimum-disagreement set, replacing ad hoc tie-breaking with an explicit rulewithout improving coverage. Bootstrap resampling found 74% to 90% of per-casevariance estimates stable despite the three-model panel.Conclusions: The eight-dimensional framework operationalizes patient-safetyrisk features that quality teams have considered only implicitly, and three inde-pendent LLM families produced reproducible scores on most dimensions ofcurated narratives. Inter-model agreement, however, measures reproducibilityrather than clinical correctness, and high agreement does not by itself establishthat a score is right; the dimensions that are reliably extractable today (notablyD6 and D8) differ from those that are not yet (D5, and to a lesser degree D4 andD7), which has direct implications for incident-reporting form design. MV-IP con-tributes a reproducible, variance-aware tie-breaking rule rather than improvedcoverage. Validation against expert-prioritized RCA lists and deployment on rawinstitutional incident reports remain the next steps toward clinical use.

16

Clinician-Centered Evaluation of Large Language Model-Generated Discharge Summaries for Longer Hospitalizations: Insights from Hospitalists and Primary Care Physicians

Osborne, T.; Mahmud, T.; Zheng, X.; Jampala, S.; Abbasi, S.; Hong, S.; Kranz, K.; Lee, S.; Ng, P.; Odekon, K.; Schachter, L.; Sexton, R.; Spinnato, T.; Tharakan, M.; Wu, Z.; Wang, F.; Wong, R.

2026-06-05 health systems and quality improvement 10.64898/2026.06.03.26354858 medRxiv

Top 0.3%

0.7%

Show abstract

Although large language models (LLMs) have shown promise for discharge summary generation, their value may be greater in longer hospitalizations, where increasing documentation volume and complexity increase both clinician burden and the risk of communication failures during transitions of care. Prior evaluations of LLM-generated discharge summaries have largely involved shorter stays and have rarely examined receiving-clinician priorities or incidental finding reporting. We compared LLM-generated and human-authored discharge summaries for 60 Internal Medicine hospitalizations lasting 7 to 21 days, with paired assessment by hospitalists and primary care physicians (PCPs). Clinician reviewers preferred LLM-generated summaries for 95% of encounters and rated them higher for quality, readability, factuality and completeness. PCPs, the primary recipients responsible for post-discharge care, found that LLM-generated summaries were better for understanding and communicating hospital care to patients, and providing follow-up care. LLM-generated summaries had fewer annotated errors, primarily due to fewer omissions, without increased estimated harm potential or likelihood compared with human-authored summaries. Benefits of LLM-generated summaries were especially salient for PCPs, who identified more omissions with greater downstream likelihood of harm than hospitalists. This underscores the importance of designing transition documents around the needs of clinicians assuming care post-discharge. LLM identification of radiology incidental findings was generally accurate and appropriate, suggesting potential to improve follow-up of clinically relevant findings. These findings extend prior work by demonstrating clinical value of LLMs in summarizing longer, complex hospitalizations and highlighting the value of stakeholder-centered design in clinical AI systems. Together, they support supervised LLM-assisted discharge summarization as a tool to reduce cognitive burden, improve documentation quality, and enhance transition-of-care communication.

17

Performance evaluation and benchmarking across 16 large language models on a comprehensive real-world emergency department triage data set

Benning, L.; Hirsch, A.; Groeschel, M.; Roeschl, T.; Spott, M.; Hans, F. P.; Urban, T.; Busch, H.-J.; Meyer, A.; Madrid, J.

2026-06-05 health informatics 10.64898/2026.05.28.26353935 medRxiv

Top 0.3%

0.6%

Show abstract

Background Emergency department (ED) triage is a high-stakes clinical decision process that determines patient prioritization and resource allocation under time pressure. Large language models (LLMs) have recently been proposed as decision-support tools for triage, yet most evaluations rely on simulated scenarios or curated datasets. Evidence from real-world clinical environments remains limited. The objective of this project was to systematically evaluate the performance, calibration, and reproducibility of multiple contemporary large language models for Emergency Severity Index (ESI) classification and sectoral allocation (ED vs. urgent care practice, UCP) using a comprehensive real-world triage dataset. Material and Methods Retrospective cross-sectional benchmarking study conducted at a tertiary academic emergency ED in Germany with an integrated central point of assessment (CPA). The study included all consecutive adult walk-in encounters (>18 years) presenting between October 2023 and February 2024 (N = 16,107). Data were collected from a structured clinical decision support system capturing presenting complaints, vital signs, and triage decisions recorded by specialized nursing staff. Structured clinical variables routinely collected at triage, including presenting complaint categories (CEDIS-PCL), vital signs according to the ABCDE framework, and additional structured or free-text clinical information. Results The primary outcome was the agreement between LLM-predicted and nurse-assigned ESI levels measured using quadratic-weighted Cohen's k. Secondary outcomes included sectoral assignment agreement, misclassification patterns (over- and under-triage), calibration metrics, and output reproducibility. Quadratic-weighted k values ranged from 0.18 to 0.75 across models. Only a structured stepwise prompting strategy achieved substantial agreement (k_qw = 0.747), approaching reported human inter-rater reliability. Most models demonstrated moderate or lower agreement and systematic overconfidence, with expected calibration errors (ECE) based on verbalized confidence ranging from 0.099 to 0.355. Sectoral assignment agreement (i.e. ED vs. urgent care practice, UCP) was uniformly low (k < 0.30). Reproducibility testing revealed substantial variability in 23% of cases, indicating non-deterministic output behavior for clinically relevant decisions. Conclusions Current large language models demonstrate heterogeneous and generally limited performance in real-world emergency triage tasks. Structured algorithm-guided prompting appears more influential than model architecture or size. Before clinical implementation, improvements in calibration, reliability, and workflow integration are required, alongside regulatory-compliant validation in prospective clinical settings.

18

Improving bystander automated external defibrillation application in Singapore: An 11-year population-based living-laboratory study

Bokman, J. T.; Singapore PAROS Investigators, ; Ee, S.; Fook-Chong, S. M. C.; Binte Ahmad, N. S.; Leong, B. S.; Chia, M. Y. C.; Okada, Y.; Ong, M. E. H.; Siddiqui, F. J.

2026-05-22 emergency medicine 10.64898/2026.05.20.26353744 medRxiv

Top 0.3%

0.5%

Show abstract

Background Bystander automated external defibrillator (BAED) use improves out-of-hospital cardiac arrest (OHCA) outcomes but remains uncommon globally. This study evaluated the outcomes of Singapore's 11-year public-access AED expansion and volunteer-responder implementation in terms of trends in BAED use, associated factors, and clinical outcomes. Methods This population-based, retrospective cohort study used Singapore Pan-Asian Resuscitation Outcomes Study (SG-PAROS) data (2010-2020) for adult, non-traumatic OHCAs. The primary outcome was bystander AED application. Multivariable logistic regression identified factors associated with use. Secondary outcomes included favorable neurological status (CPC 1-2), survival to discharge, and prehospital return of spontaneous circulation (ROSC). Results Of 21,439 included OHCA cases (median age 70.0 years; 63.8% male), BAED use increased from 1.7% to 9.6% over 11 years, with a corresponding increase in overall survival from 2.4 to 4.0%. Malay ethnicity (aOR 1.25, 1.06-1.49), calendar year (aOR 1.26, 1.22-1.29), and delayed emergency medical services (aOR 1.24, 1.06-1.45) were positive predictors of BAED use. Conversely, BAED use was lower among females (aOR 0.80, 95% CI 0.69-0.94), at night (aOR 0.69, 0.56-0.86), and in residential settings (aOR 0.06, 0.05-0.07). Volunteer arrival strongly increased application (aOR 4.16, 3.41-5.09), with a significant interaction (p<0.001); the effect was greater in residential (aOR 7.38, 5.81-9.38) than non-residential settings (aOR 1.71, 1.22-2.40). AED use predicted favorable neurological outcome (aOR 2.80, 2.24-3.50; NNT 8.7), survival (aOR 2.30, 1.89-2.80), and ROSC (aOR 2.11, 1.81-2.46). Conclusion Over 11 years, we saw a significant increase in BAED application and favorable neurological survival. This success was associated with the implementation of an integrated strategy combining widespread AED deployment, national training, and smartphone-activated volunteer responders. Singapore's experience provides a scalable model for urban centers seeking to expand their AED strategy.

19

Real-time Computer Vision Assisted Navigation for Endoscopic Pituitary Surgery: Iterative Development and Comparative Preclinical Evaluation

Khan, D. Z.; Mao, Z.; Hudson, G.; Wijekoon, A.; Chen, J.-e.; Borg, A.; Dorward, N.; Blandford, A.; Clarkson, M.; McCulloch, P.; Bano, S.; Stoyanov, D.; Marcus, H.

2026-06-04 surgery 10.64898/2026.06.02.26354760 medRxiv

Top 0.3%

0.5%

Show abstract

Background Endoscopic pituitary surgery involves navigating high-stakes anatomy where complications, such as carotid artery injury, cause devastating morbidity. While computer vision AI offers potential for real-time anatomical recognition to mitigate these risks, successful translation requires rigorous human-factors and performance evaluation. We present the iterative development and preclinical evaluation of a surgeon-controlled, real-time AI-assisted navigation system. Methods Guided by IDEAL Stage 0 and DECIDE-AI frameworks, the study was conducted in two phases. Phase 1 was an exploratory study where surgeons used the system during high-fidelity simulated surgery and provided feedback via "Think Aloud" protocols and surveys. Following prototype iteration, a Phase 2 randomized crossover comparative trial was conducted with 19 neurosurgeons (15 trainees, 4 experts) performing high-fidelity simulated tumour resections with and without AI assistance, separated by a minimum 2-week washout. The primary outcome was surgical technical performance (OSATS). Workload, educational value, usability, trust, and implementation outcomes were also assessed. Results Phase 1 informed hardware, model, and interface refinements, including optimized pedal-controlled overlays and prediction confidence metrics. In the comparative trial, AI assistance significantly improved overall technical performance (OSATS 19.79+/-4.06 vs. 17.32+/-4.11; p=0.027). This gain was experience-dependent; AI significantly augmented trainee performance (19.20+/-3.76 vs. 16.60+/-3.78), narrowing the proficiency gap, while expert performance remained high and stable. 100% of participants identified the system as a useful training tool. However, subjective workload was significantly higher in the AI arm (SURG-TLX 26.42+/-9.56 vs. 22.26+/-7.81; p=0.014). Despite this, usability (SUS 75.13+/-14.31) and implementation feasibility, acceptability, and appropriateness scores were consistently high (means >4.4/5). Conclusions This study provides a stepwise process for real-time AI development using pituitary surgery as a high-stakes exemplar. The refined surgeon-centric AI system improves training and technical performance, particularly for trainees. Next steps involve first-in-human studies and further exploration of longer-term human factors such as over-reliance, cognitive overload mitigation and trust calibration.

20

Closing the gaps: Improving physical health diagnosis in the emergency department for patients with mental health conditions

Jayaprakash, A.; Liberati, E.; Lindsay, R.; Willars, J.; Gibson, J.; Fritz, Z.; Price, A.; Hatfield, T.; Richards, N.; Martin, G.

2026-06-08 emergency medicine 10.64898/2026.06.05.26354970 medRxiv

Top 0.3%

0.5%

Show abstract

Objectives People with mental health conditions experience increased rates of diagnostic errors and delays in acute treatment. While causes such as diagnostic overshadowing (misattribution of physical symptoms to mental health conditions) are well documented, less attention has been paid to the organisational and structural conditions that shape diagnostic work. This study examines how physical illness is diagnosed in patients with mental health conditions in emergency departments (EDs), with a focus on the structural conditions that enable or constrain safe diagnostic practice. Method We conducted a multi-site ethnography across three purposively selected EDs in England between April 2023 and April 2024, varying in size, population demographics, and local service configuration. Data were collected through 284 hours of non-participant observation and 20 semi-structured interviews with ED staff. Results Our analysis identified four recurring structural gaps that shaped the conditions under which physical health diagnosis took place for patients with mental health conditions: a design gap, whereby targets and physical layouts constrained diagnostic reasoning; a preparedness gap, reflecting the lack of structural support to allow staff to act on their existing knowledge and skills; a coordination gap, reflecting fragmented ownership and the challenges of joint assessment across mental and physical healthcare teams; and an expectation gap, whereby unmet need elsewhere in the system increased demand for ED services that were beyond its formal scope. These gaps made diagnostic errors and delay more likely for patients with mental health conditions seeking physical healthcare in the ED. Conclusions As new dedicated mental health EDs are introduced in England, there is an opportunity to avoid reproducing these structural gaps in new settings. Our study suggests that improving physical healthcare for patients with mental health conditions requires changes to how EDs are designed, resourced and supported, and how they connect with the wider health and care system. Keywords: mental health, diagnostic inequality, emergency departments